ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group95a.txt / 000036_icon-group-sender _Mon Jan 30 15:36:49 1995.msg < prev next >

Wrap

Internet Message Format | 1995-02-09 | 5KB

Received: by cheltenham.cs.arizona.edu; Mon, 30 Jan 1995 09:37:21 MST To: icon-group-l@cs.arizona.edu Date: Mon, 30 Jan 1995 15:36:49 GMT From: goer@quads.uchicago.edu (Richard L. Goerwitz) Message-Id: <1995Jan30.153649.11735@midway.uchicago.edu> Organization: University of Chicago Sender: icon-group-request@cs.arizona.edu References: <3geb4t$cfn@magnum.convex.com>, <3gg0cv$jv8@hahn.informatik.hu-berlin.de> Reply-To: goer@midway.uchicago.edu Subject: Unicode (was Re: Linux, HPFS, and internationalization) Errors-To: icon-group-errors@cs.arizona.edu I'm cross-posting to the Icon newsgroup, because of the recent discussion there of Unicode and internationalization issues. Followups, though, are directed back to comp.os.linux.development.system: In article loewis@informatik.hu-berlin.de (loewis) writes: > >In general, I believe internationalization efforts of Linux should >introduce Unicode wherever reasonable - file names is one of the places. >The question about sorting now is still: according to what rules. When >displaying information to the user, you would like to follow the sorting >rules of the user's native language. In the file system, all that counts >is that you never lose accessability, as you point out. These are two >different things, though. I think that this was IBM's idea when tagging directory entries for code page. They wanted different sort orders for different "locales" (in this case identified with codepages). Just a hunch. One of the problems with Unicode, incidentally, is that despite all the hoopla, information about it is being disseminated very, very slowly. And it is doubtful that Unicode will ever displace standards like Shift- JIS in Asia. Also, note that if localization is a concern (as in the above posting), Unicode isn't a cure-all. Unicode is kind of a super- ISO 8859-1 in the sense that it doesn't tell you what language or locale you're in. So, for example, if I run into an Arabic alif in Unicode, I really don't know whether I'm looking at Arabic, Persian, or Urdu. The problem is the same for the so-called "CKJ" languages, Chinese, Korean, and Japanese. This would be a good time for someone who's worked on Plan 9 to jump in with advice. What would be a sensible way of migrating to Unicode or other standards? Do we use UTF-8 (about which information is even harded to come by than Unicode)? Or do we use some form of wide char I/O, using straight Unicode? Or do we default to UTF-8 for backwards compatibility, but provide facilities for straight Unicode? As usual, I must confess that I'm not a software engineer. I'm in Near Eastern Languages. But I'm following this group because the Linux community seems unusually responsive to internationalization issues, at least on the discussion level. (Apps. don't seem to be moving along in this direction.) Part of the problem is that information isn't all that widely disseminated (at least in the US) about how other scripts and encoding systems work. Programmers just don't have the basic info they need. And few Americans understand how Asian or Middle Eastern or Indian scripts work (the bidirectional wordwrap algorithm, for ex- ample, baffles many of them - not because they're dumb, but because of simple lack of exposure to Arabic, Hebrew, etc.). As a first step, everyone ought to peruse the comp.software.internation- alization FAQ, which at least sets forth how to use ANSI C setlocale. A second step is to buy the Unicode manual, now out of date (but still useful enough): The Unicode Standard Version 1.0 two volumes Addison Wesley, 1990,91 At the end of that volume is a horrendous account of the bi-di wordwrap algorithm that is guaranteed to mystify anyone who has not studied Arabic and/or Hebrew. If anyone gets this far, and wants clean, simple info, please contact me directly. I'll post my own informal, programmer ori- ented description of the bidi algorithm if asked. For UTF-8, check out http://www.stonehand.com/unicode/standard/utf8.html. Basically, it works as follows: * Bits Hex Min Hex Max Byte Sequence in Binary * 7 00000000 0000007f 0vvvvvvv * 11 00000080 000007FF 110vvvvv 10vvvvvv * 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv * 21 00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv * 26 00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv * 31 04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv * * The UCS value is just the concatenation of the v bits in the multibyte * encoding. When there are multiple ways to encode a value, for example * UCS 0, only the shortest encoding is legal. The idea is that no UTF-8 sequence can be confused with ASCII codes, since the first 128 places (if I understand correctly) constitute a compatibility zone. Note that some space is lost in storage, but compression takes care of this nicely. Internally, one can use whatever one pleases (16-bit chars are sufficient for Unicode). I hope this helps. The Linux community is an interesting, energetic bunch. -- Richard L. Goerwitz *** goer@midway.uchicago.edu